12  Regular expressions

Author

Vladimir Buskin, Thomas Brunner

12.1 Regular expressions

Regular expressions (or ‘regex’) help us find more complex patterns in strings of text. Suppose we are interested in finding all inflectional forms of the lemma PROVIDE in a corpus, i.e., provide, provides, providing and provided. Insteading of searching for all forms individually, we can construct a regular expression of the form

\[ \text{provid(es | ing | ed)?} \]which can be read as ‘Match the sequence of letters <provide> as well as when it is optionally followed by the letters <s> or <ing> or <ed>’. Notice how optionality is signified by the ‘?’ operator and alternatives by ‘|’.

To activate regular expression in a KWIC query, simply set the valuetype argument to "regex":

# Load library and corpus
library(quanteda)
ICE_GB <- readRDS("ICE_GB.RDS")

# Perform query
kwic_provide <- kwic(ICE_GB,
                     phrase("provid(es|ing|ed)?"),
                     valuetype = "regex",
                     window = 20)

The number of hits has more than doubled. However, upon closer inspection, we’ll notice a few false positives, namely providential, provider and providers:

table(kwic_provide$keyword)

      provid      provide     provided     Provided    Provident providential 
           1          165          118            5            1            1 
    provider    providers     provides    providing    Providing 
           1            3           72           52            1 

There are two ways to handle this:

  1. Refine the search expression further to only match those cases of interest.
  2. Manually sort out irrelevant cases during annotation in your spreadsheet software.

As a rule of thumb, you should consider improving your search expression if you receive hundreds or even thousands of false hits. If there are only a couple of false positives, it’s usually easier to simply mark them as “irrelevant” in your spreadsheet.

Task

How could you refine the search expression for PROVIDE to get rid of the irrelevant cases? Consult the RegEX Cheatsheet below!

Solution:
# Add word boundary with \\b
kwic_provide2 <- kwic(ICE_GB,
                     phrase("\\bprovid(e|es|ing|ed)\\b"),
                     valuetype = "regex",
                     window = 20)

table(kwic_provide2$keyword)

12.2 A RegEx Cheatsheet

12.2.1 Basic functions

Command Definition Example Finds
python python
. Any character .ython aython, bython…

12.2.2 Character classes and alternatives

Command Definition Example Finds
[abc] Class of characters [jp]ython jython, python
[ ^pP] Excluded class of characters [^pP]ython everything but python, Python
(...|...) Alternatives linked by logical operator or P(ython|eter) Python, Peter

12.2.3 Pre-defined character classes

Command Definition Example Finds
\\w All alphanumeric characters A-Z, a-z, 0-9
\\W All non-alphanumeric characters everything but A-Z, a-z, 0-9
\\d All decimal numbers 0-9
\\D Everything which is not a decimal number everything but 0-9
\\s Empty space
\\b Word boundary \\bpython\\b Matches python as a whole word

12.2.4 Quantifiers

Command Definition Example Finds
? One or zero instances of the preceding symbol Py?thon Python, Pthon
* No matter how many times — also zero Py*thon Python, Pthon, Pyyyython…
P[Yy]*thon Python, Pthon, PyYYython…
+ No matter how many times but at least once Py+thon Python, Pyyython, Pyyyython
{1,3} {min, max} Py{1,3}thon Python, Pyython, Pyyython

12.3 Exercises

  1. Find all labels of months!
  2. Write an elegant regular expression which finds sing, sang and sung.
  3. Find all four-digit numbers in the corpus!
  4. Write an elegant regular expression which finds all inflectional forms of swim!